── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.2
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
theme_set(theme_classic(base_size =16))
1 Learning objectives
Understand probability as a measure of uncertainty
Summarize probabilities with probability distributions
Think about uncertainty in sampling
Summarize uncertainty in sampling: Sampling distribution
2 Mathematically-derived sampling distribution
Many combinations of population distribution and test statistic have mathematically-derived sampling distributions
For example, with a normally distributed population (\mathcal{N}(\mu, \sigma^2)), the sampling distribution of the mean is \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})
In this case, the population distribution and sampling distribution are the same distribution: Normal
The variance (or standard deviation) of the sampling distribution decreases as n increases: A larger sample leads to a smaller, more precise standard error
Note
There is not a mathematically-derived sampling distribution for every population distribution and test statistic
If you’re dealing with unknown or unusual distributions or test statistics, your only option may be computational methods
3 Computationally-derived sampling distribution
3.1 Getting started
Set a random number seed
If you don’t do this, every time you run the code, you’ll get different numbers
Setting a random seed makes your simulation replicable
(I normally set a seed at the start of each document, but I moved it down here because it’s very important in this lab.)
infer has a function that we’ll use to repeatedly sample
There are other ways to do this, but this is very simple and works well
You can install a new package a few different ways
Via the Console: install.packages("infer")
Via the “Packages” pane (bottom right): Click “Install” button, type in the package name, and click “Install”
Install only once, load or library once per document
set.seed(12345)library(infer)
3.2 Standard Normal population: population \sim \mathcal{N}(0, 1)
3.2.1 Create a population
rnorm() creates a random variable that follows a normal distribution
First argument = Number of observations (population size)
Second argument = \mu
Third argument = \sigma (not \sigma^2)
Create a very large population (N = 1,000,000) with a specified mean and variance
Confirm that the mean and SD are what you think they should be
norm_pop <-rnorm(1000000, 0, 1)mean(norm_pop)
[1] 0.0002941277
sd(norm_pop)
[1] 1.001315
var(norm_pop)
[1] 1.002632
3.2.2 Sample from the population
Small sample: n = 10
rep_sample_n() repeatedly samples from the dataset
Piped (%>%) from the norm_pop dataset, so it resamples from that
size = 10: Sample size for each sample
reps = 10000: Number of repeated samples (could be larger but this is good enough to show how things work and doesn’t take too long to run)
replace = TRUE: Sample with replacement, so observations go back in after each sample and can be sampled again in the next sample
summarise(x_bar = mean(norm_pop)): Create a data frame with each of the means from each sample
There are 10,000 samples, so there will be 10,000 means in this
The ggplot() code creates a histogram from the data
Modification: bins = 50 increases the bins to make it a little smoother
Modification: xlim(-1.5, 1.5) sets limits on the x-axis to make the plots more comparable
I had to run the plots first without this to see how far I needed to let the X axis go. So for later plots here, run without, look at it, then add this with appropriate numbers to put them all on the same scale
Make similar modifications in your own plots later in this document
We’ll spend the next 2 weeks on ggplot() so you’ll learn all of this (and more!)
How does each value correspond to what you’d expect based on the mathematically-derived sampling distribution for a normally distributed population?
\bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})
4 Summarize results
In general, how does sample size impact the standard error?
i.e., the standard deviation of the sampling distribution?
Do larger samples lead to smaller or larger standard errors?
In general, how does the population distribution impact the sampling distribution?
Based on: Visual inspection of the distribution
Based on: Estimates of standard error
Source Code
---title: "BTS 510 Lab 3"format: html: embed-resources: true self-contained-math: true html-math-method: katex number-sections: true toc: true code-tools: true code-block-bg: true code-block-border-left: "#31BAE9"---```{r}#| label: setuplibrary(tidyverse)theme_set(theme_classic(base_size =16))```## Learning objectives* Understand **probability** as a measure of **uncertainty*** **Summarize** probabilities with *probability distributions* * Think about **uncertainty** in sampling* **Summarize** uncertainty in sampling: **Sampling distribution**## Mathematically-derived sampling distribution* Many combinations of **population distribution** and **test statistic** have *mathematically-derived* sampling distributions * For example, with a **normally distributed** population ($\mathcal{N}(\mu, \sigma^2)$), the sampling distribution of the **mean** is $\bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$ * In this case, the *population* distribution and *sampling* distribution are the **same distribution**: Normal * The variance (or standard deviation) of the sampling distribution **decreases** as $n$ increases: A *larger sample* leads to a smaller, *more precise standard error*::: {.callout-note}* There is not a mathematically-derived sampling distribution for every population distribution and test statistic * If you're dealing with *unknown or unusual distributions or test statistics*, your only option may be *computational methods*:::## Computationally-derived sampling distribution### Getting started* Set a **random number seed** * If you don't do this, every time you run the code, you'll get different numbers * Setting a random seed makes your simulation *replicable* * (I normally set a seed at the start of each document, but I moved it down here because it's very important in this lab.)* Load (and install, if needed) the [**infer** package](https://infer.tidymodels.org/) * **infer** has a function that we'll use to *repeatedly sample* * There are other ways to do this, but this is very simple and works well * You can install a new package a few different ways * Via the Console: `install.packages("infer")` * Via the "Packages" pane (bottom right): Click "Install" button, type in the package name, and click "Install"* Install only **once**, load or library **once per document**```{r}set.seed(12345)library(infer)```### Standard Normal population: $population \sim \mathcal{N}(0, 1)$#### Create a population* `rnorm()` creates a random variable that follows a **normal** distribution * First argument = Number of observations (population size) * Second argument = $\mu$ * Third argument = $\sigma$ (not $\sigma^2$)* Create a very large population (N = 1,000,000) with a specified mean and variance * Confirm that the mean and SD are what you think they should be```{r}norm_pop <-rnorm(1000000, 0, 1)mean(norm_pop)sd(norm_pop)var(norm_pop)```#### Sample from the population* Small sample: $n$ = 10 * `rep_sample_n()` repeatedly samples from the dataset * Piped (`%>%`) from the `norm_pop` dataset, so it resamples from that * `size = 10`: Sample size for each sample * `reps = 10000`: Number of repeated samples (could be larger but this is good enough to show how things work and doesn't take too long to run) * `replace = TRUE`: Sample with replacement, so observations go back in after each sample and can be sampled again in the next sample * `summarise(x_bar = mean(norm_pop))`: Create a data frame with **each of the means** from **each sample** * There are 10,000 samples, so there will be 10,000 means in this * The `ggplot()` code creates a histogram from the data * Modification: `bins = 50` increases the bins to make it a little smoother * Modification: `xlim(-1.5, 1.5)` sets limits on the x-axis to make the plots more comparable * I had to run the plots first without this to see how far I needed to let the X axis go. So for later plots here, run without, look at it, then add this with appropriate numbers to put them all on the same scale * Make similar modifications in your own plots later in this document * We'll spend the next 2 weeks on `ggplot()` so you'll learn all of this (and more!)```{r}norm_means10 <-data.frame(norm_pop) %>%rep_sample_n(size =10, reps =10000, replace =TRUE) %>%summarise(x_bar =mean(norm_pop))ggplot(data = norm_means10, aes(x = x_bar)) +geom_histogram(bins =50) +xlim(-1.5, 1.5)```* Moderate sample: $n$ = 50```{r}norm_means50 <-data.frame(norm_pop) %>%rep_sample_n(size =50, reps =10000, replace =TRUE) %>%summarise(x_bar =mean(norm_pop))ggplot(data = norm_means50, aes(x = x_bar)) +geom_histogram(bins =50) +xlim(-1.5, 1.5) ```* Large sample: $n$ = 100```{r}norm_means100 <-data.frame(norm_pop) %>%rep_sample_n(size =100, reps =10000, replace =TRUE) %>%summarise(x_bar =mean(norm_pop))ggplot(data = norm_means100, aes(x = x_bar)) +geom_histogram(bins =50) +xlim(-1.5, 1.5)```#### Compare estimates* Compare means for each sample size```{r}mean(norm_means10$x_bar)mean(norm_means50$x_bar)mean(norm_means100$x_bar)```* Compare standard deviations for each sample size```{r}sd(norm_means10$x_bar)sd(norm_means50$x_bar)sd(norm_means100$x_bar)```* Compare variances for each sample size```{r}var(norm_means10$x_bar)var(norm_means50$x_bar)var(norm_means100$x_bar)```* How does each value correspond to what you'd expect based on the mathematically-derived sampling distribution? * $\bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$### Bernoulli population: $population \sim B(0.3)$#### Create a population* `rbinom()` creates a random variable that follows a **binomial** distribution * First argument = Number of observations (population size) * Second argument = number of trials (here, just 1 for a **Bernoulli** trial) * Third argument = $p$ (probability of success)* Create a very large population (N = 1,000,000) with a specified mean and variance * Confirm that the mean and SD are what you think they should be```{r}bern_pop <-rbinom(1000000, 1, 0.3)mean(bern_pop)sd(bern_pop)var(bern_pop)```#### Sample from the population::: {.callout-note}* Remember to adjust the X axis using `xlim()` (as in the Normal section) to have the same range for each plot.* You'll need to make each plot, see what they look like, and then adjust the X axis accordingly.:::* Small sample: $n$ = 10```{r}bern_means10 <-data.frame(bern_pop) %>%rep_sample_n(size =10, reps =10000, replace =TRUE) %>%summarise(x_bar =mean(bern_pop))ggplot(data = bern_means10, aes(x = x_bar)) +geom_histogram(bins =50)```* Moderate sample: $n$ = 50```{r}```* Large sample: $n$ = 100```{r}```#### Compare estimates* Compare means for each sample size```{r}```* Compare standard deviations for each sample size```{r}```* Compare variances for each sample size```{r}```* How does each value correspond to what you'd expect based on the mathematically-derived sampling distribution **for a normally distributed population**? * $\bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$### Poisson population: $population \sim Poisson(1.5)$#### Create a population* `rpois()` creates a random variable that follows a **Poisson** distribution * First argument = Number of observations (population size) * Second argument = $\lambda$ (mean and variance) * Remember that $\lambda$ is the variance, but we're going to look at the **standard deviation** which is the *square root* of the variance```{r}pois_pop <-rpois(1000000, 1.5)mean(pois_pop)sd(pois_pop)var(pois_pop)```#### Sample from the population::: {.callout-note}* Remember to adjust the X axis using `xlim()` (as in the Normal section) to have the same range for each plot.* You'll need to make each plot, see what they look like, and then adjust the X axis accordingly.:::* Small sample: $n$ = 10```{r}pois_means10 <-data.frame(pois_pop) %>%rep_sample_n(size =10, reps =10000, replace =TRUE) %>%summarise(x_bar =mean(pois_pop))ggplot(data = pois_means10, aes(x = x_bar)) +geom_histogram(bins =50)```* Moderate sample: $n$ = 50```{r}```* Large sample: $n$ = 100```{r}```#### Compare estimates* Compare means for each sample size```{r}```* Compare standard deviations for each sample size```{r}```* Compare variances for each sample size```{r}```* How does each value correspond to what you'd expect based on the mathematically-derived sampling distribution **for a normally distributed population**? * $\bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n})$## Summarize results* In general, how does **sample size** impact the **standard error**? * i.e., the standard deviation of the sampling distribution? * Do larger samples lead to smaller or larger standard errors?* In general, how does the **population distribution** impact the **sampling distribution**? * Based on: Visual inspection of the distribution * Based on: Estimates of standard error